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METHOD AND APPARATUS FOR SMOOTHING FUNDAMENTAL FREQUENCY 
DISCONTINUITIES ACROSS SYNTHESIZED SPEECH SEGMENTS 



FIELD OF THE INVENTION 

The present invention relates to methods and systems for speech processing, and in 
particular for mitigating the effects of frequency discontinuities that occur when speech segments 
are concatenated for speech synthesis. 

DESCRIPTION OF RELATED ART 

Concatenating short segments of pre-recorded speech is a well-known method of 
synthesizing spoken messages. Telephone companies, for example, have long used this 
technique to speak numbers or other messages that may change as a result of user inquiry. 
Newer, more sophisticated systems can synthesize messages with nearly any content by 
concatenating speech segments of varying length. These systems, referred to herein as "text-to- 
speech" (TTS) systems, typically include pre-recorded databases of speech segments designed to 
include all possible sequences of fundamental speech sounds (referred to herein as "phones") of 
the language to be synthesized. However, it is often necessary to use several short segments 
from disjoint parts of the database to create a desired utterance. This desired utterance, i.e., the 
output of the TTS system, is referred to herein as the "target." 

Ideally, the original recordings cover not only phone sequences, but also a wide range of 
variation in the talker's fundamental frequency F 0 (also referred to as "pitch"). For databases of 
practical size, there are typically cases where it is necessary to abut segments which were not 
originally contiguous, and for which the F 0 is discontinuous where the segments join. Although 
such a discontinuity is almost always noticeable to some extent, it is particularly noticeable when 
it occurs in the middle of a strongly-voiced region of speech (e.g., vowels). 

The change in the fundamental frequency F 0 as a function of time (i.e., the F 0 contour) in 
human speech encodes both linguistic information and "para-linguistic" information about the 
talker's identity, state of mind, regional accent, etc. Speech synthesis systems must preserve the 
details of the F 0 contour if the speech is to sound natural, and if the original talker's identity and 
affect are to be preserved. Automatic creation of natural-sounding F 0 contours from first 
principles is still a research topic, and no practical systems which sound completely natural have 

1 
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piiriit E ™ ,ess is ia ° m aboM characterizm * Fo contours of a 

Concatenation-based TT<s 
database, and tha, scIect these .J""" ' ha ' *™ length from a Wge 

* ""emuce, are Known fa the ^ ' " " ^ » ~~ *« -*« 

... . me art as unit-selechon synthesizes •' a . ,t 

synthes.zer is being bui.,, „ is typical , y ^ As *• source database for sueh , 

boundaries. Tbe degree of vow Ls he ^ Md ~ 

■nformabon is tabulated ^ ^ ^ T ° f ^ » d °*- 

speech of the energy and F 0 as functions o{ J ^ « ™de on the source 

10 to aid in the selection of the most aonro , * ™ aVailaWe du ™S synthesis 

speech of its constituent words the nrn • StrUCtUre ' the P a « of 

rough idea of the targe, F 0 eontour, .he duration of a, V ana ' yS ' S ° f a 

5 synthesized can be estimated. dUra "° nof » s and the energy in the speech ,o be 

The purpose of the unit-selertinn „ 
of speech from the J^^Z^T*** " * — *» 
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OIten impossible to find exart F mot u 
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•* sound better as . direct resuh of the smoothing si d 7 " ^ ^ ^ ^ 
because the unit seiection component can relax the F ^ ben " 

can relax the F 0 eontmuity consent, .nd consequently 
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the targe, utterance. d ' s ™P< *e emphasis or semantics of 
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cue ,o ft e vision :i:™tT "°: ,he other hand ' the fm ~ fr ~ F » * 
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20 SUMMARY OF THE INVENTION 
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-s, e,c. aii such.::: rrrir:;?™ weathe — 

invention as defined in the claims m " a " 0n *° ^ of thc 
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TZ~ t0 or,8,nal recordm8 ' whi,e * f « «- 

he gmen, bo dane , In one ^ ^ ^ 

o n, b addition of a , inear ^ ^ a stra _ ght jine ^ vanafek offle( ^ « 
ngma, F 0 contour of the segmen , This discIosure ^ ^ ^ P 
5 of hnear functions ,o be added to the segment* comprising the synthaic utteranc , This ^ 

— r^ esinftes,opeof,heoriginaip »-" tou --— a„ y :r 

F 0 short stents over ,o„g segments, beeause such changes are more likCy t0 be more 
noticeable in the longer segments. 

The technique descnbed herein preferably does no, introduce smoothtng of F 0 anywhere 
1 0 except exactly at the segment boundary, and is much less .Otely to generate false >, h 
man pnor art alternatives such as global low-pass ffltering or local linear interpolation 

The method and system described herein is robust enough to accommodate occasional 
error ,„ the measurement of F 0 , and cons.s.s of two prtmary components. The firs, component 
robustly estimates ,he F. found in me origina! source da,a. The second component generis me 
15 correoon functions ,o ma,eh mis measured F 0 across ,he speech segmen, boundanes 

Accordmg to one aspect, the inven.ion comprises a method of smoothing fundamental 
frequency discontinuities a, boundanes of concatenated speech segments as defined in Cairn 1 
Each speech segmen, is characterized by a segmen, fundamen.al frequency con.our and 
mc.ud.ng two or more frames. The method includes determining, for each speech segment a 
20 begmmng fundamental frequency value and an ending fundamental frequency vaiue The ' 
method further includes adjusting the fundamental frequency con.our of each of the speech 
segments according ra a linear function calculated for each particular speech segment The 
parameters charactering each ..near function are selected accordmg to the beginning 

2s :r::::r ncy va,ue Md ,he end,ns ■ — - - 

In one embodimen,, the predetermined function includes a linear function. ,„ another 
em odmtent, ,he prede.enttined Sanction adjusts a slope associated with the speech segmen, In 
another embodtment, the predetermined function adjust* an offset associated with the speech 
segment. F 

fn another embodiment, the predetermined function calculated for each particular speech 
segmen, „ dependent upon a length associated with the speech segment, such that the 
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predefined function adjusts ,onger segments more than shorter segments. ,„ other words the 
longer a segment is, the more significant* the predetermined fnncfion adjusts it 

segmen^r ^""^ ^ ^ ^ ~ «* ^ 

gment These parameters may inc.nde ,i) a tota, duration of the segment, W a tota, duration of 

5 a„ vo.ce, reg,ons of the segment, . average va]ue ^ frequ 

over a d regions of , he ^ ^ a ^ ^ rf ^ ^ 

over vo.ced reg.ons of the segment, and (v) a standard deviation of the fundamenta, fluency 
on,o ur ov the who|e segmen( Combfaations Qf _^ ot o(hCT , 

listed may also be determined. 
' Another embodiment further mcludes setting the determined median value of the 

I" C ° ntOUr a " VO ' Ced reg ' 0nS ° f ~ to - — ge value of the 
firndamenta, frequency contour over ah voiced regions of the segment, if a number of 

freqUen ° y SamP ' eS ^ SPCeCh SCSmmt " "» 3 -,ue (i.e., a 

Another embodiment further includes examining a predetermined number of frames from 
a eg,nm„g pom, of each speech segment, and setting the beginning fhndamenta, fluency 
vatae to a fundamenta, fre q ue„cy vaiue of the first frame, if a„ fundamental frequency values of 
e predetemrmer, number of frames from the beg.nnmg point of the speech segment are within a 
predetermined range. 

Another embodiment further includes examining a predefined number of frames from 
a endmg pom, of each speech segment, and setfing me endmg fimdamema, frequency value ,o a 
fundamemal frequency vaiue of me las, frame if all fundamental frequency values of the 
predefined number of frames from the ending poin, of ft. speech segmen, are wi,hin a 
predetermined range. 

Another embodiment further includes setting the beginning fundamental frequency and 
the endmg flmdamenta, frequency of unvoiced speech segments to a value substantially equal to 
a med.an va.ue of the fundamenta, frequency contour over a„ voiced regions of a preceding 
voiced segment. * 

Another embodiment further indudes calculating, for eachTSirTtfadjacen, speeSh 

segments n and „ +1 , (i) a fir, ratio of the „* ending fundamenta, frequency v^7o "the „ +1 » 
beg.nn.ng flmdamenta, frequency va,ue, (ii, a second ratio being .he inverse of the firs, ratio and 
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adjusting the ending &ndamemal frequency va]ue ^ ^ n+j|i begjmjng & 

T2Z value ' on,y if the f,rs * ra "° ma the second ra,io - iess ,ta a p ~«> »*> 

> speech A " 0ther emb0d ' raent fc,,her '" ClUdeS CalCUla,i " g *"« f <* «* ^ividua, 

> speech segment according to a coupled spring model. 

firs, JT" emb ° diment ^ inC ' UdK ,mP ' ementing ^ C ° UP,ed *** m ° del «* *« * 
firs, spnng eomponen, couples the beginning fundamenta, fre q „ency value to an anchor 

component, a second spring component couples the endmg fundamenta, frequency value to the 
anchor component, and a thud spring component couples the beginning fundamenta, frequency 
value to the ending fundamental frequency value. 

Another embodiment partner includes associating a spring consran, with the firs, spring 
and the second spring such that the spnng constant is proportion, to a duration of voicing in I 
associated speech segment. S 

Another embodiment further includes associating a spring constant with the third spring 
such that me thtrd spring models a non-linear restoring force tha, resists a change in slope f the 
segment fundamental frequency contour. 

Another embodiment further includes forming a set of simultaneous equations 
co.espond.ng ,o the coupled spring models associated with al, of the concatenated speech 
segments, and solving the se, of s.muhaneous equations to produce the parameters charactering 
each lmear funCon associated with one of the speech segments. 

Another embodiment farther includes solvmg the se, of simultaneous equations through 
an nerauve algorithm based on Newton's method of finding zeros of a function 

In another aspect, the invention comprises a system for smoothing fundamenta, frequency 
dtscotmnmues a, boundaries of concatenated speech segments as defined m c.aim ,8 Each 
speech segment is characterized by a segment fundamenta, frequency contour and inCuding two 
or more frames. The system inCudes a unit characterization processor for receiving the speech 
segments and characterizing each segment with respect to the beginning mndamenta, frequency 
and the endmg fundamental frequency. The system farther includes a mndamenta, frequency 
adjustment processor for receiving the speech segments, the beginning fundamental frequency 
and endmg fundamental frequency. The fundamenta, frequency adjustment processor also 
adjusts the fundamenta, frequency contour of each of the speech segments according to a lmear 
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fimoon calcu,a,ed for each parficu.ar speech segment The parameters characerizing each 
..near function are sdected aecordmg ,„ the beginning mndamenta! frequency value and the 
endmg fundamental frequency value of the corresponding speech segment 

In another embodiment, the unit characterization processor determines a number of 
. parameters associated with each speech segment. These parameters may include (i) a total 
durauon of the segment, ffl, a total duration of all voiced regions of the segment, (iii, a average 

value of the mndamenta, frequency contour over all voiced regions of the segment, and (v) a 
standi deviation of (he andamenta| fcquency ^ over ae whofe s ^ ^ 

of these parameters, or other parameters not listed may also be determined 

In another embodunent, the unit characterization processor sets the determined median 

vat I IT rT^' freqUen ° y COn ' OUr 0VCr a " V ° iCed ° f ' he ~ to «" -rage 

value of the fundamental frequency contour over all voiced reg,ons of the segment, if a number 

of fundamental frequency samples in the speech segment is less than a predetermined value 

In another embodiment, the unit characterization processor examines a predetermined 
number of frames from a beg,„n,ng point of each speech segment, and sets fhe beginning 
fundamental frequency value to a fundamental frequency value of the firs, frame if al. 
fundamental frequency values of the predetermined number of frames from fhe begmning poinf 
of the speech segment are within a predetermined range. 

In another embodiment, the unit characterization processor examines a predetermined 
number of frames from a ending point of each speech segment, and sets fbe ending fundamental 
frequency value to a fundamental frequency va,ue of fhe las. frame if al, fimdamenfal frequency 
vah.es of .he predetermined number of frames from the ending pom, of .he speech segment are 
within a predetermined range. 

In another embodiment, the uni, characterization processor sets me beginning 
mndamenta, frequency and fhe ending fundament frequency of unvoiced speech segments ,o a 
value substannally equal to a median value of the fundamental frequency confour over a„ voiced 
regions of a preceding voiced segment. 

In another embodiment, fhe uni, characterizat.on processor calculates, for each pair of 
adjacent speech segments „ and n +1 , (0 a firs, ratio of the n* endmg fundamenfa, frequency 
va,ue ,o me „ + r beginning fundamenta, frequency va.ue, (,i) a second ratio being the inverse of 
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.he firs, ratio, and adjusts the „» endtng fundamental frequency value and the n + l » begging 
fundamental frequency value only if the firs, ratio and the second ratio are ,ess than a 
predetermined ratio threshold. 

In another embodiment, the fundamental frequency adjustment processor calculates the 
hnear funchon for each individual speech segment according ,o a coupled spring model 

In another embod.ment, me fundamental frequency adjustment processor implements the 

coupled spnng mode, such that a firs, spnng component couples the beginning fundamental 

frequency value to an anchor component, a second spring component coupies the ending 

fitndamcnta, frequency value to the anchor component, and a third spnng component couples the 

begtnnmg fundamental frequency value to the ending fundamental frequency value 

In another embodiment, the fundamental frequency adjustment processor associates a 
spnng constant wjth the first sprjng ^ fte ^ ^ ^ ^ ^ ^ ^ 

proportional to a duration of voicing in the associated speech segment. 

In another embodiment, the fundamental frequency adjustment processor associates a 
spnng constant with the third spnng such that the third spring models a non-linear restoring force 
that reststs a change in slope of the segment fundamental frequencv contour 

In another embodiment, me fundamental frequency adjustment processor forms a set of 
simultaneous equations conesponding to the coupled spring mod.,s associated with all of the 
concatenated speech segments, and solves the se, of simultaneous equations to produce the 
parameters characterizing each linear function associated with one of the speech segments 

In another embodtmcn, the fttndamema, frequency adjustment processor solves the se, of 
stmultaneous e q ua,ions through an i,era,ive a lg ori,hm based on Newton's method of finding 
zeros of a function. 

In another aspect, the invention comprises a method of detetmintng, for each of a series 
of concatenated speech segments, a beginning fundamental frequency value and an endtng 
fundamental frequency value. Each speech segment is characterized by a segment fundamental 
frequency contour and including two or more frames. The method includes detetmining a 
number of parameters associated with each speech segment. These parameters may include (i) a 
total duration of the segment, (ii) a total duration of a„ voiced regtons of the segment (iii) a 
average value of the fundamental frequency contour over a„ voiced regions of the segment (iv) a 
med.an value of the fundamental frequency contour over a„ voiced regions of the segment and 
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(v) a s«andard deviation of ,he fundament frequency contour over the whole segment The 
parameter may ine.ude co mbina(ions thereof , or other pa _ ^ & 

nch.de, se„m g th e me d,a„ va.ue of ,he freemen,, fluency contour over aU vo.ced regions 
o he seg men, to .he average va,ue of the fundamenta, frequency contour over a,, voiced egions 

r ir™': , number of fcndamen,ai &K,uency sampies - - — ~ - - 

rr e ' me,hod fl,nher in ° ,udes a pred -~ of 

frames from . beginning point of each speech segment, and setting the beginning fundamenta, 
fr^uency va,ue to a fundamenta, frequency va,ue of the firs, frame if a„ fundamenta, frequency 
---epredeterm, —ber of frames from the be^ng pom, of the speech aegH" 

22 , , T ranse ' me,hod teher includes e — 8 * I— 

number of frames from a ending point of each speech segment, and setting the ending 
fundamenta, frequency va,ue to a fundamenta, freqnency va,ue of the .as, frame if a,, 
to amenta, frequency va,ues of the predetertnined number of frames from the ending pom, of 
me speech segment are within a predetermined range. The method further inCudes setting the 
bcgtnnmg frmdamenta, frequency and the ending fundamenta, frequency of unvoiced speech 
segments to a * s „bs,an,ia„y equa, ,o a med.an va,ue of the frmdamenta, frequency contour 
over a, votced regtons of a preceding voiced segment. The method frtther inCudes ca,cu,ati„g 
for each pan of adjacent speech segments „ and n+1 , (i) . first ratio of ^ ^ * 
frequency va.ue to the „ +1 * begnming fimdame„,a, frequency va,ne, a second L bemg 
nverse of the firs, ratio, and adjusting the „* ending mndamenta, frequency vahte and me „ +I » 
begtnntng fundamenta, frequency va,ue on,y if the fi rst ratio and the second ratio are ,ess than a 
predetermined ratio threshold. 

contou "T^' mVenti ° n C ° mpriSeS 3 a fundamenta, frequency 

contour of each of a scnes of concatenated speech segments according to a hnear function 
calcu.ated for each parttcu.ar speech segment. The parameters characterizing each hnear 
fimctton are se.ec.ed according ,o a beginning frmdamenta. frequency va,ue and an endmg 
firndamenta, frequency va,ue of the corresponding speech segment. The method inCudes 

T de at Th "TT fcnC ' i0n &r eaC " indiV ' dUa ' SP " Ch " aCCMdi " S t0 * 

mode, The coup.ed spnng mode, is impiemented such ,ha, a firs, spring component coup.es the 

begntmng fundamenta, frequency va,„e to an anchor component, a second spring component 

coup.es the ending fundamenta, freqnency va,ue to the anchor component, and a third p ring 
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componen, couples the beginning tadaniental frequency vaJue (q ^ ending ft n 
frequency value. The method taher i„c,ude S forming a set of simultaneous equations 
con-esponding ,o the coupled spring models associated with al. of the concatenated speech 
segments, and solving the se, of simu.taneous equations to produce the parameters charactering 
5 each hnear functton associated with one of the speech segments. 

A preferred embodtmen, provides a method of determining, for each of a series of 
concatenated speech segments, a beginning fundamental frequency value and an ending 
frandamenta, frequency value, each speech segment character^ by a segment fundamental 
frequency contour and including two or more frames, comprising: 
' determining, for each speech segmental) a. otal duration of the segment, (ii) a, oral 

duranon of a„ vo.ced regions of the segment, a average value of the fundamental frequency 
contour over a„ voiced regions of me segment, (iv) a median value of the fnndamenfa, frequency 
contour over al, voiced regions of the segment, and (v) a standard deviation of the fundamental 
frequency contour over the whole segment; 

setting the median value of the fundamental frequency contour over al, voiced regions of 
the segment to the average va.ne of the fundamenfa, frequency contour over a„ voiced regtons of 
the segment if a number of fundamenta, frequency samp.es in the speech segment is ,ess than a 
predetermined value; 

examining a predetennined number of frames from a begmnmg point of each speech 
segment, and setting the beginning fundamenfa, frequency va.ue to a fundamenfa, frequency 
value of the firs, frame if a„ fundamenta, frequency va,ues of the predetermined number of 
frames from the beginning point of the speech segment are within a predetennined range- 
exammmg a predetennined number of frames from a ending point of each speech' 
segment, and setting the ending fundamental frequency value to a fundamental frequency value 
of me as, frame if a„ fundamenfa, frequency vahres of , he predetennined number of frames from 
the endmg point of the speech segment are wifhin a predetennined range; 

setting fhe beginning fundamenfa, frequency and fhe ending fundamenfa, frequency of 
unvotced speech segments «o a value suhsfanfially equal ,o a median value of fhe fundamenfa. 
frequency contour over a.! vo.ced regions of a preceding voiced segment- and 

calculating, for each pair of adjacent speech segments „ and n + l , (i) a firs, rafio of me n* 
endmg fundamental frequency value to the beginning fundamental frequency value (ii) a 
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second ratio being the inverse of the firs, ratio, and adjusting the „» ending fundamental 
frequency value and the n + f heginning fundamental frequency value only if the firs, ratio and 
the second ratio are less than a predetermined ratio threshold. 

The preyed embodiment also provides a method of adjusting a fundamental frequency 
5 contour of each of a series of concatenated speech segments according to a tinear function 
calculated for each particular speech segment, wherein parameters characterizing each linear 
taction are selected according to a beginning fundamental frequency value and an ending 
fundamental frequency value of the corresponding speech segment, comprising 

calculating the hnear function for each individual speech segment according to a coupled 

spnng model, wherein the coupled spring model is implemented such that a firs, spring 
component couples the beginning fundamental frequency va,ue to an anchor component a 
second spnng component couples the ending fundamental frequency value to the anchor' 
component, and a third spring component couples the beginning fundamental frequency value to 
the ending fundamental frequency value; and, 

forming a set of simultaneous equations corresponding to the coupled spring models 
assorted with all of the concatenated speech segments, and solving the se, of simultaneous 
equations to produce the parameters characterizing each hnear function assoctated with one of 
the speech segments. 

There is also provided a preferred system for smoothing fundamental frequency 
dtscontinutties a. boundaries of concatenated speech segments, each speech segment 
characterized by a segment fundamental frequency contour and including two or more frames 
comprising: 

means for determining, for each speech segment, a beginning fundamental frequency 
value and an ending fundamental frequency value; 

means for adjusting the fundamental frequency contour of each of the speech segments 
according to a linear function calculated for each particular speech segment, wherein parameters 
characters each hnear function are selected according to fhe beginning fundamental 
fluency value and fhe endfng fundamental frequency value of the corresponding speech 
segment. 

According to another aspect of the present invention, there is provided a method 
according to claim 36. 
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According to another aspect of the present invention, there is provided a system 
according to claim 37. 

BRIEF DESCRIPTION OF DRAWINGS 
5 The foregoing and other aspects of embodiments of this invention, may be more ftd.y 

understood from the Mowing descnpfion of the preferred embodiments, when read together 
with the accompanying drawings in which: 

JIG. 1 shows a block diagram view of an embodtmen. of a F 0 adjustment processor for 
smoothmg fundamental frequency discontinuities across synthesized speech segments 
» FIG. 2 shows, in flow-diagram form, the steps performed to determine the beginning 

fundamental frequency and the ending fundamental frequency of the speech segments- 

FIG. 3 A shows the coupled-spring model according to an embodtmen, of fhe present 
mventton pnor to adjustments to beginning and ending FO values; and, 

FIG. 3B shows fhe coupled-spring model of FIG. 3A after to adjustments to beginning 
> and ending FO values. s 



DESCRIPTION OF THE PREFERRED EMBODIMENTS 

FIG. 1 shows, in the confexfofaTTS system 100, a block diag ram view of one preferred 
embodtmen, of a F 0 adjustment processor .02 for smoo,hi„g fundamental frequency 
d,scon,,n„iries across synced speech segmen,, In addition ,o the F„ adjustment processor 
102, the TTS system .00 includes a uni, source da,abase .04, a uni, selection processor .06 and 
a urn, characterization processor ,08. The source database ,04 tncludes speech segmenfs (also 
referred to as "units" herein) of various lengths, along with assoctafe charac.enzing data as 
descnbed in more detail herein. The uni, selection processor ,06 receives tex, data . ,0 ,o be 
synthesized and se.ecs appropriate unifs from fhe source database ,04 oon-esponding ,o fhe text 
da,a 0. The unt, characterization processor .08 receives the selected speech unifs from the 
urn. se,ec,ion processor .06 and Anther characterizes each uni, with respect to endpoin, F 0 (i e 
begtnnmg fundamental frequency and ending ftmdamenfa. frequency), and other parameters as' 
descnbed herein. The F„ adjustment processor ,02 receives the speech unifs along wifh fhe 
assoctafed characterization parameters from fhe characterization processor ,08, and adjusts the F 



BST99 1299384-1.063711.0016 
LND99 224597-1.063711.0016 



12 



) 



of each unit as described in more detail herein, so as to match the F 0 characteristics at the unit 
boundaries. The F 0 adjustments processor 102 outputs corrected speech segments to a speech 
synthesizer 1 12 which generates and outputs speech. Although these components of the TTS 
system 100 are described conceptually herein as individual processors, it should be understood 
that this description is exemplary only, and in other embodiments, these components may be 
implemented in other architectures. For example, all components of the TTS system 100 could 
be implemented in software running on a single computer system. In other embodiments, the 
individual components could be implemented completely in hardware (i.e., application specific 
integrated circuits). 

In preparing the source database 104, the F 0 and voicing state VS (i.e., one of two 
possible states: voiced or unvoiced) of all speech units are estimated using any of several F 0 
tracking algorithms known in the art. One such tracking algorithm is described in "A robust 
Algorithm for Pitch Tracking (RAPT)," by David Talkin, in "Speech Coding and Synthesis," 
E.B. Kleijn & K.K. Paliwal, eds., Elsevier, 1995. These estimates are used to find the "glottal 
closure instants" (referred to herein as "GCIs") that occur once per cycle of the F 0 during voiced 
speech, or that occur at periodic locations during the unvoiced speech intervals. The result is, for 
each speech segment, a series of estimates of the voicing state and F 0 at intervals varying between 
about 2 ms and 33 ms, depending on the local F 0 . Each estimate, referred to herein as a "frame- 
may be represented as a two-tuple vector (F 0 , VS). The majority of these frames will be correct, 
but as many as 1% may be quite wrong, where the estimated F 0 and/or voicing state are 
completely wrong. If one of these bad estimates is used to determine the correction function, 
then the result will be seriously degraded synthesis; much worse than would have resulted had no 
"correction" been applied. It should be further noted, that, since the unit selection process has 
already attempted to gather segments from mutually-compatible contexts in the source material, 
it is rare that extreme changes in F 0 will be required to effectively smooth across the speech 
segment boundaries. Finally, the amount of audible degradation in the target due to F 0 
modification is greater as the variation increases, so that extreme F 0 correction may degrade 
rather than improve the result, even if the relevant F 0 estimates are correct. 

The following input parameters are provided to and used by the unit characterization 
processor 108, along with the frames and the associated speech segments, to calculate a number 
of output parameters: 
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• MIN FO -ru • ■ ^ 

• RISKY STD ThemimmumFo allowed in any part of the system. 

J he nu ™ ber of standard deviations in F 0 variation between 
adjacent F 0 samples allowed before the measurements are 
considered suspect. 

NROBUST r Jhe n umber of F 0 samples required in a segment to establish 

. HTTP pnmTOT re ^ble estimates of F 0 mean and median. 

_ROBUST ™ & d t Urati °K ° f 3 required before p o ^tistics in the 

. N FO PHPrr segment can be considered to be reliable. 

i q _* u_CHhCK The number of adjacent F 0 measurements near the segment 

endpomts which must be within RISKY_STD of one another 
before a single F 0 measurement at the endpoint is accepted as 
the true value of F 0 . 

MAXRATIO The maximum ratio of F 0 estimates in adjacent segments over 

j 5 which smoothing will be attempted. ems over 

The number of frames in the segment. 
The number of voiced frames contained in a segment. 

Values of these parameters used in the preferred embodiment are: 

20 



25 



• 


MINFO 


33.0 Hz 


• 


RISKYSTD 


1.5 


• 


N_ROBUST 


5 


• 


DURROBUST 


0.06 sec. 


• 


N_F0_CHECK 


4 


• 


MAX_RATIO 


1.8 



However, less preferred parameters might fall in the following ranges: 



• 20.0 <= MIN_F0 <= 50.0 Hz 
30 • 1.0 <= RISKYSTD <=2.5 

• 3 <= N_ROBUST <=10 

• ° 04 <= DUR ROBUST <=0.1 sec 

• 3 <= N_F0_CHECK <= 10 

• 1-2 < MAXJRATIO <= 3.0 



35 



and these should not limit the scope of the invention as defined in the claims. 
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The following are the output parameters generated by the characterization processor 108 



0 



Dim 

VDUR 

F0_MEAN 

F0_MEDIAN 

F0_STD 

F01 

F02 



The duration of the entire segment. 

The total duration of all voiced regions in the segment. 

Average F 0 value over all voiced regions in a segment. 

Median F 0 value over all voiced regions in a segment. 

The standard deviation in F 0 over the whole segment. 

The estimate of F 0 at the beginning of a segment (beginning 
fundamental frequency). 

The estimate of F 0 at the end of a segment (ending 
fundamental frequency). 



The speech segments (also referred to herein as "units") returned by a typical unit- 
selection algorithm employed by the unit selection processor 106 may consist of one or many 
phones, and duration of each segment may vary from 30ms to several seconds. The method and 
system described herein is suitable for segments of any length. For each segment to be used in 
the target utterance, F01 and F02 are estimated by performing the following steps, illustrated in 
flow-diagram form in FIG. 2: 



1. 
2. 
3. 
4. 



5. 



6. 



8. 



10. 



Set 202 N_F0 to the number of voiced frames in the segment 
Compute 204 DUR and V_DUR of the segment 
Compute 206 F0_MEAN, F0_STD and F0_MEDIAN for the segment 
If the segment is unvoiced (N_F0 equals 0) 208, and no other segments preceding it in 
he target sequence have been voiced 210, skip the remainder of the steps, and proceed to 
the next segment at step 1. p 

If ^-o?,T 0) 2 °. 8 ' , bUt thiS S6gment is P receded b y one or more segments containing 
voicing 210, use the last estimate of F0JVIEDIAN as both F01 and F02 for this segment 
214, then go on to the next segment at step 1 

pn^xlVfo 5 than N - ROBUST 216 > ^t F0JVIEDIAN for the segment to its 
■T U_JVLb AJN 218. 

Starting at the beginning of the segment, examine the first N_F0_CHECK frames If 
they are all voiced 220, and if their F 0 measurements all fall within (RISKY STD * 
F0 STD) of the following frame's measurement 222, set F01 to the first F 0 measurement 
m the segment 224, then go to step 10, else, go to step 8 

t *™ A^rifJf? ^ DUR - R0B UST or N_F0 is less than N ROBUST 226, set F01 
to F0_MEDIAN for the segment 228, then go to step 1 0, else goto step 9 

Starting at the beginning of the segment, find the first N ROBUST F0 measurements 
(voiced frames). Set F01 to the mean of F 0 found in these frames 230 

Starting at the end (last frame) of the segment, examine the last N F0 CHECK frames 
rrn f™ced 232, and if their F 0 measurements all fall within (RISKY STD * ' 

hO STD) of the preceding frame's measurement 234, set F02 to the last F 0 measurement 
in the segment 236, then go to step 1 for the next segment, else go to step 11. 
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"■fn^^^!^^ 0 ^-^ 8 ^ 7 "^™" 1 * 801 ^ 1 ^ ROBUST 238 setF02 
step 12 Se8ment 24 °' th6n g ° t0 St6p 1 f0r the ne " xt else go to 

12. Starting at the end of the segment, find the last N ROBUST FO measurements (voiced 
segment. * ° f F ° ^ ™ ^ ^ G ° t0 ste ? 1 

At the end of these steps M, DUR, V_DUR, F01 and F02 are known for all segments comprising 
the target utterance. These values can be subscripted to indicate their dependence upon the 
1 0 segment, as is shown in the examples herein. 

As a final step before actually computing the correction functions, a check is made on the 
reasonableness of matching FO across the segment boundaries. If 
F02(n) 

F01(n+1) > MAX_RATIO 



or 



ic F01(n+1) 

15 F02(n) > MAX_RATIO , 

then that boundary is marked to indicate that the F 0 endpoint values on either side should be left 
unchanged. This is useful for two reasons. First, large alterations to F 0 will result in unnatural- 
soundingspeech, even if the estimates for F02(n) and F01(n+1) are reasonable. Second, it is 
relatively rare that large ratios are encountered, so when one is found, the likely cause is that the 
10 F 0 tracker has made an error. In both cases, it is prudent to leave these endpoints unchanged. 

The next part of the process modifies the F 0 of the original speech segments by applying 
relatively simple correction functions, which are unlikely to significantly alter the prosody of the 
original material. The term "prosody," as used herein, refers to variations in stress, pitch, and 
rhythm of speech by which different shades of meaning are conveyed. Using a simple low-pass 
5 filter to modify the F 0 contours in an attempt to smooth across the boundaries produces two 
undesirable results. First, some of the natural variation in the speech will be lost. Second, a 
local variation due to the F 0 discontinuity at the segment boundary will still be retained, and will 
constitute "noise" in the prosody. The method described herein adds simple, linear functions at 
least or substantially linear functions to the original segment F 0 contours to enforce F 0 continuity 
3 across the joins while retaining the original details of relative F 0 variation largely unchanged, 
except for overall raising or lowering, or the introduction of slight changes in overall slope. The 
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proposed method favors introducing offsets to short segments over long segments, and 
discourages large changes in overall slope for all segments. We will now describe one possible 
embodiment of the idea that employs a coupled-spring model to satisfy the constraints. 

The coupled-spring model is shown in FIGs. 3A and 3B. FIG. 3A depicts a series of 
segments S(n) to be concatenated of respective durations (n) in time, with estimated endpoint F 
values F01(n) and F02(n) "attached" to- the springs which tend to resist changes in the endpoints 
The coupled-spring model includes three spring components for each speech segment. The first 
spring component couples the beginning fundamental frequency value F01(n) to an anchor 
component 310 (i.e., a fixed reference with respect to the segments), a second spring component 
couples the ending fundamental frequency value F02(n) to the anchor component, and a third 
spring component couples the beginning fundamental frequency value F01(n) to the ending 
fundamental frequency value F02(n). The constants of proportionality of the various spring 
components are indicated as k(n). These endpoint values are adjusted to be equal where the 
segments connect. dl(n) is the correction (or displacement) applied to F01(n), and d2(n) is the 
correction applied to F02(n), for all n segments in the utterance; n - 1, .... N. F 0 values between 
the endpomts in each segment will have a correction value applied that is linearly interpolated 
between dl(n) and d2(n). Thus, the correction function will be a straight line with intercept and 
slope determined for each segment. The values for dl(n) and d2(n) are determined for the whole 
utterance by the coupling of springs as shown in FIG. 3B. At each segment endpoint, a vertically 
onented spring resists change in F 0 with a spring constant k(n) which is proportional to the 
duration of voicing in the segment, so that long voiced segments will have a "stiffer" vertical 
spring than short, or less voiced segments. 

k(n) = V_DUR(n) * KD , 

where KD is the constant of proportionality. The forces which resist changes in F 0 will be 
25 denoted G, with 

Gvl(n) = k(n) * dl(n) 

and 

Gv2(n) = k(n) * d2(n) . 

The horizontally-oriented springs in FIGs. 3A and 3B represent the non-linear restoring force 
30 that resists changes in slope. The displacements at the endpoints, d 1 (n) and d2(n), are 



20 
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constrained to be strictly vertical, so that any difference in the endpoint vertical displacements 
will result in a stretching of the horizontal spring. An effective length l(n), is assigned to each 
segment using the relation 

l(n) = DUR(n) * LD , 

5 where LD is the constant relating total segment duration in seconds to effective mechanical 
length for the purpose of the spring model. The length, L(„), of the "horizontal" spring will be 
greater than, or equal to l(n), depending on the difference in the endpoint displacements for the 
segment. Let 

D(n) = d2(n) - dl(n) , 

0 then, by simple geometry: 

L(n)="\/D(n) 2 + l(n) 2 . 

The tension in the "horizontal" spring can be resolved into its horizontal and vertical 
components. We are only concerned with the vertical components, 

Gtl(n) = -KT * D(n) * jl - j^J , 

) and 

Gt2(n) = -Gtl(n) . 

KT is the spring constant for all horizontal springs, and is identical for all segments. Finally, the 
total vertical forces on the segment endpoints are 

Gl(n) = Gvl(n) + Gtl(n) , 

and 

G2(n) = Gv2(n) + Gt2(n) . 

For small changes in slope, Gt is small, but grows rapidly as the slope increases. For segments 
containing little or no voicing, Gv is small, but Gt remains in effect to couple, at least weakly, 
the F 0 values of segments on either side. 

The coupling comes about by requiring that 
d2(n) - dl(n+l) = F01(n+1) - F02( n) 

and 

G2(n) + Gl(n+1) = 0, 
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for all n; n - 1, ... N-l, segments in the utterance, except at the boundaries of the utterance, 
where 

G1(1) = 0 , 

and 

G2(N) = 0 . 

The set of simultaneous non-linear equations is solved using an iterative algorithm. It is based 
on Newton's method of finding zeros of a function. Since the sum offerees at each junction 
must be made zero, the solution is approached by computing the derivatives of these sums with 
respect to the displacements at each junction, and using Newton's re-estimation formula to arrive 
at converging values for the displacements. As described herein, some segment endpoints were 
marked as unalterable because MAX_RATIO was exceeded across the boundary. The 
displacements of those endpoints will be held at zero. The iteration is carried out over all 
segments simultaneously, and continues until the absolute value of the ratio of (a) the sum of 
forces at each node to (b) their difference is a sufficiently small fraction. In one embodiment the 
ratio should be less than or equal to 0. 1 before the iteration stops, but other fractions may also be 
used to provide different performance. In practice, a typical utterance of 25 segments will require 
10-20 derations to converge. This does not represent a significant computational overhead in the 
context of TTS. 

The model parameters used in one preferred embodiment are: 
20 • KD l.o 

• KT l.o 

• LD 1000.0 

However, less preferred model parameters might fall in the ranges: 

• 0.001 <= KD <= 10.0 
25 • 0.001 <= KT <= 10.0 

• 1.0 <= LD <= 10000.0 

and these should not limit the scope of the invention as defined in the claims. 

By adjusting these parameter values, it is possible to alter the behavior of the model to 
best suit the characteristics of a particular talker, speaking style or language. However, the 
values listed work well for a range of talkers, and languages. Increasing LD will make the onset 

1 0 
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of the highly non-linear term in the slope restoring force less abrupt. Increasing KD relative to 
KT will encourage slope change more, and overall segment offset less. Large values of KT 
relative to KD will encourage overall segment offset rather than slope change. 

Once the coupled-spring equations have been solved, the displacements dl (n) and d2(n) 
may be used to correct the endpoint F 0 values. If the original F 0 values for the segment were 
F0(n,i), and each segment starts at time t0(n), and the frames occur at times t(n,i), then the n» 
segment's corrected F 0 values, given by F0'(n,i) for all M(n) frames i = 1, M(n) , are 

F0'(n,i) = F0(n,i) + dl(n) + |(d2(n) - dl(n)) * t( ° ?1 ? T " t0(n) l 

I DUR(n) J ' 

If F0'(n,i) is less than MINFO for any frame, then F0'(n,i) is set to MINFO. These corrections 
are only applied to voiced frames. Nothing is changed in the unvoiced frames. In FIG. 3B, these 
modified segments are labeled S*(n). 

Various prior art methods exist for synthesizing the target utterance's waveform with the 
modified F 0 values. These include Pitch Synchronous Overlap and Add (PSOLA), Multi-band 
Resynthesis using Overlap and Add (MBROLA), sinusoidal waveform coding, harmonics+noise 
models, and various Linear Predictive Coding (LPC) methods, especially Residual Excited 
Linear Prediction (RELP). References to all of these are easily found in the speech coding and 
synthesis literature known to those in the art. 

The invention may be embodied in other specific forms without departing from the scope 
of the invention as defined in the claims. The present embodiments are therefore to be 
considered in respects as illustrative and not restrictive, the scope of the invention being 
indicated by the appended claims rather than by the foregoing description, and all changes which 
come within the meaning and range of the equivalency of the claims are therefore intended to be 
embraced therein. While some claims use the term "linear function" in the context of this 
invention, a substantially linear function or a non-linear function capable of having the desired 
effect would be adequate. Therefore the claims should not be interpreted on their strict literal 
meaning. 
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Claims: 



11. A method of smoothing fundamental frequency discontinuities at boundaries of 

concatenated speech segments, each speech segment characterized by a segment fundamental 
frequency contour and including two or more frames, comprising: 

determining, for each speech segment, a beginning fundamental frequency value and an 
ending fundamental frequency value; 

6 adjusting the fundamental frequency contour of each of the speech segments according to 

7 a predetermined function calculated for each particular speech segment, wherein parameters 

8 characterizing each predetermined function are selected according to the beginning fundamental 

9 frequency value and the ending fundamental frequency value of the corresponding speech 
10 segment. 

12. A method according to claim 1 , wherein the predetermined function adjusts a slope 
2 associated with the speech segment. 

1 3. A method according to claim 1, wherein the predetermined function adjusts an offset ■ 

2 associated with the speech segment. 

1 4. A method according to claim 1 , wherein the predetermined function includes a linear 

2 function. 



1 5. A method according to claim 1 , wherein the predetermined function calculated for each 

2 particular speech segment is dependent upon a length associated with the speech segment, such 



3 



that the predetermined function adjusts longer segments more than shorter segments. 
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A method according to claim 1, further including determining, for each speech segment 
one or more parameters selected from: (i) a total duration of the segment; (ii) a total duration of 
all vcced regions of the segment; (iii) a average value of the fundamental frequency contour 
over all voiced regions of the segment; (iv) a median value of the fundamental frequency contour 
over all voiced regions of the segment; and (v) a standard deviation of the fundamental frequency 



6 contour over the whole segment. 



7. A method according to claim 6, further including setting the determined median value of 
the fundamental frequency contour over all voiced regions of the segment to the average value of 
the fundamental frequency contour over all voiced regions of the segment if a number of 
fundamental frequency samples in the speech segment is less than a predetermined value. 

8. A method according to any preceding claim, further including examining a predetermined 
number of frames from a beginning point of each speech segment, and setting the beginning 
fundamental frequency value to a fundamental frequency value of the first frame if all 
fundamental frequency values of the predetermined number of frames from the beginning point 

5 of the speech segment are within a predetermined range. 



1 8. 
2 
3 
4 



1 9. A method according to any preceding claim, further including examining a predetermined 

2 number of frames from an ending point of each speech segment, and setting the ending 



3 fundamental frequency value to a fundamental frequency value of the last frame if all 

4 fundamental frequency values of the predetermined number of frames from the ending point of 

5 the speech segment are within a predetermined range. 



1 10. A method according to any preceding claim, further including setting the beginning 

2 fundamental frequency and the ending fundamental frequency of unvoiced speech segments to a 

3 value substantially equal to a median value of the fundamental frequency contour over all voiced 

4 regions of a preceding voiced segment. 
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A method according to any preceding claim, further including calculating, for each pair of 
adjacent speech segments n and n+1 one or more of: (i) a first ratio of the n* ending fundamental 
frequency value to the n+1* beginning fundamental frequency value; and (ii) a second ratio being 
the inverse of the first ratio; and adjusting the n th ending fundamental frequency value and the 
n+l* beginning fundamental frequency value only if the first ratio and/or the second ratio are 



6 less than a predetermined ratio threshold. 



1 12. A method according to any preceding claim, further including calculating the function for 

2 each individual speech segment according to a coupled spring model. 

1 13. A method according to claim 12, further including implementing the coupled spring 

2 model such that a first spring component couples the beginning fundamental frequency value to 

3 an anchor component, a second spring component couples the ending fundamental frequency 

4 value to the anchor component, and a third spring component couples the beginning fundamental 

5 frequency value to the ending fundamental frequency value. 

1 14. A method according to claim 13, further including associating a spring constant with the 

2 first spring and the second spring such that the spring constant is proportional to a duration of 

3 voicing in the associated speech segment. 

1 15. A method according to claim 13 or 14, further including associating a spring constant 

2 with the third spring such that the third spring models a non-linear restoring force that resists a 

3 change in slope of the segment fundamental frequency contour. 

1 16. A method according to any of claims 12-15, further including forming a set of 

2 simultaneous equations corresponding to the coupled spring models associated with all of the 

3 concatenated speech segments, and solving the set of simultaneous equations to produce the 

4 parameters characterizing each linear function associated with one of the speech segments. 
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1 17. A method according to claim 16, further including solving the set of simultaneous 

2 equations through an iterative algorithm based on Newton's method of finding zeros of a 

3 function. 

1 18. A system for smoothing fundamental frequency discontinuities at boundaries of 

2 concatenated speech segments, each speech segment characterized by a segment fundamental 

3 frequency contour and including two or more frames, comprising: 

4 a unit characterization processor for receiving the speech segments and characterizing 

5 each segment with respect to a beginning fundamental frequency and an ending fundamental 

6 frequency; 

7 a fundamental frequency adjustment processor for receiving the speech segments, the 
beginning fundamental frequency and ending fundamental frequency, and for adjusting the 
fundamental frequency contour of each of the speech segments according to a predetermined 
function calculated for each particular speech segment, wherein parameters characterizing each 
predetermined function are selected according to the beginning fundamental frequency value and 
the ending fundamental frequency value of the corresponding speech segment. 



1 19. A system according to claim 1 8, wherein the predetermined function adjusts a slope 

2 associated with the speech segment. 

1 20. A system according to claim 1 8, wherein the predetermined function adjusts an offset 

2 associated with the speech segment. 

121. A system according to claim 1 8, wherein the predetermined function includes a linear 

2 function. 

1 22. A system according to claim 1 8, wherein the predetermined function calculated for each 

2 particular speech segment is dependent upon a length associated with the speech segment, such 

3 that the predetermined function adjusts longer segments more than shorter segments. 
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1 23 . A system according to claim 1 8, wherein the unit characterization processor determines, 

2 for each speech segment one or more of: (i) a total duration of the segment; (ii) a total duration of 

3 all voiced regions of the segment; (iii) an average value of the fundamental frequency contour 

4 over all voiced regions of the segment; (iv) a median value of the fundamental frequency contour 

5 over all voiced regions of the segment; and (v) a standard deviation of the fundamental frequency 

6 contour over the whole segment. 

1 24. A system according to claim 23, wherein the unit characterization processor sets the 

2 determined median value of the fundamental frequency contour over all voiced regions of the 

3 segment to the average value of the fundamental frequency contour over all voiced regions of the 

4 segment if a number of fundamental frequency samples in the speech segment is less than a 

5 predetermined value. 

1 25. A system according to any of claims 1 8-24, wherein the unit characterization processor 

2 examines a predetermined number of frames from a beginning point of each speech segment, and 

3 sets the beginning fundamental frequency value to a fundamental frequency value of the first 

4 frame if all fundamental frequency values of the predetermined number of frames from the 

5 beginning point of the speech segment are within a predetermined range. 

1 26. A system according to any of claims 1 8-25, wherein the unit characterization processor 

2 examines a predetermined number of frames from a ending point of each speech segment, and 

3 sets the ending fundamental frequency value to a fundamental frequency value of the last frame 

4 if all fundamental frequency values of the predetermined number of frames from the ending point 

5 of the speech segment are within a predetermined range. 



2 
3 



27. A system according to any of claims 1 8-26, wherein the unit characterization processor 
sets the beginning fundamental frequency and the ending fundamental frequency of unvoiced 
speech segments to a value substantially equal to a median value of the fundamental frequency 
contour over all voiced regions of a preceding voiced segment. 
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1 28. A system according to any of claims 1 8-27, wherein the unit characterization processor 

2 calculates, for each pair of adjacent speech segments n and n+1 one or more of: (i) a first ratio of 
the n* ending fundamental frequency value to the n+l th beginning fundamental frequency value; 
and (ii) a second ratio being the inverse of the first ratio, and adjusts the n th ending fundamental 
frequency value and the n+l th beginning fundamental frequency value only if the first ratio 

6 and/or the second ratio are less than a predetermined ratio threshold. 

1 29. A system according to any of claims 1 8-28, wherein the fundamental frequency 

2 adjustment processor calculates the linear function for each individual speech segment according 

3 to a coupled spring model. 

1 30. A system according to claim 29, wherein the fundamental frequency adjustment 

2 processor implements the coupled spring model such that a first spring component couples the 

3 beginning fundamental frequency value to an anchor component, a second spring component 

4 couples the ending fundamental frequency value to the anchor component, and a third spring 

5 component couples the beginning fundamental frequency value to the ending fundamental 

6 frequency value. 

1 31. A system according to claim 30, wherein the fundamental frequency adjustment 

2 processor associates a spring constant with the first spring and the second spring such that the 
spring constant is proportional to a duration of voicing in the associated speech segment. 



3 



1 32. A system according to claim 30, wherein the fundamental frequency adjustment 

2 processor associates a spring constant with the third spring such that the third spring models a 

3 non-linear restoring force that resists a change in slope of the segment fundamental frequency 

4 contour. 

1 33. A system according to any of claims 29-32, wherein the fundamental frequency 

2 adjustment processor forms a set of simultaneous equations corresponding to the coupled spring 

3 models associated with all of the concatenated speech segments, and solves the set of 
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4 simultaneous equations to produce the parameters characterizing each linear function associated 

5 with one of the speech segments. 

1 34. A system according to claim 33, wherein the fundamental frequency adjustment 

2 processor solves the set of simultaneous equations through an iterative algorithm based on 

3 Newton's method of finding zeros of a function. 
4 

36 A method of smoothing fundamental frequency discontinuities at boundaries of 

5 concatenated speech segments, each speech segment characterized by a segment fundamental 

6 frequency contour and including two or more frames, comprising: 

7 adjusting the fundamental frequency contour of each speech segment according to a 

8 predetermined function calculated for each particular speech segment, wherein the predetermined 

9 function is dependent upon a length associated with the speech segment, such that the 
10 predetermined function adjusts longer segments more than shorter segments. 

1 37. A system for smoothing fundamental frequency discontinuities at boundaries of 

2 concatenated speech segments, each speech segment characterized by a segment fundamental 

3 frequency contour and including two or more frames, comprising: 

4 a fundamental frequency adjustment processor for adjusting the fundamental frequency 

5 contour of each speech segment according to a predetermined function calculated for each 

6 particular speech segment, wherein the predetermined function is dependent upon a length 

7 associated with the speech segment, such that the predetermined function adjusts longer 

8 segments more than shorter segments. 
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ABSTRACT 



A method of smoothing fundamental frequency discontinuities at boundaries of 
concatenated speech segments includes determining, for each speech segment, a beginning 
fundamental frequency value and an ending fundamental frequency value. The method further 
includes adjusting the fundamental frequency contour of each of the speech segments according 
to a linear function calculated for each particular speech segment, and dependent on the 
beginning and ending fundamental frequency values of the corresponding speech segment. The 
method calculates the linear function for each speech segment according to a coupled spring 
model with three springs for each segment. A first spring constant, associated with the first 
spring and the second spring, is proportional to a duration of voicing in the associated speech 
segment. A second spring constant, associated with the third spring, models a non-linear 
restoring force that resists a change in slope of the segment fundamental frequency contour. 
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